Deep Bayesian Neural Nets as Deep Matrix Gaussian Processes
نویسندگان
چکیده
We show that by employing a distribution over random matrices, the matrix variate Gaussian Gupta & Nagar (1999), for the neural network parameters we can obtain a non-parametric interpretation for the hidden units after the application of the “local reprarametrization trick” (Kingma et al., 2015). This provides a nice duality between Bayesian neural networks and deep Gaussian Processes Damianou & Lawrence (2013), a property that was also shown by Gal & Ghahramani (2015). We show that we can borrow ideas from the Gaussian Process literature so as to exploit the non-parametric properties of such a model. We empirically verified this model on a regression task. 1 MATRIX-VARIATE GAUSSIAN The matrix variate Gaussian (Gupta & Nagar, 1999) is a three parameter distribution that governs a random matrix, e.g. W: p(W) =MN (M,U,V) = exp ( − 1 2 tr [ V−1(W −M)TU−1(X−M) ]) (2π)np/2|V|n/2|U|n/2 (1) where M is a r × c matrix that is the mean of the distribution, U is a r × r matrix that provides the covariance of the rows and V is a c × c matrix that governs the covariance of the columns of the matrix. According to Gupta & Nagar (1999) this distribution is essentially a multivariate Gaussian distribution over the “flattened” matrix W: p(vec(W)) = N (vec(M),V ⊗ U), where vec(·) is the vectorization operator (i.e. stacking the columns into a single vector) and ⊗ is the Kronecker product. 2 BAYESIAN NEURAL NETS WITH MATRIX-VARIATE GAUSSIANS For the following derivation we will assume that each input to a layer is augmented with an extra dimension containing 1’s so as to account for the biases and thus we are only dealing with weights W on this expanded input. In order to obtain a matrix variate Gaussian posterior distribution for these weights we will perform variational inference and we can work in a pretty straightforward way: the derivation is similar to (Graves, 2011; Kingma & Welling, 2014; Blundell et al., 2015; Kingma et al., 2015). Let pθ(W), qφ(W) be a matrix variate Gaussian prior and posterior distribution with parameters θ, φ respectively and (xi,yi)i=1 be the training data sampled from the empirical distribution p̃(x,y). Then the following lower bound on the marginal log-likelihood can be derived: Ep̃(x,y)[log p(Y|X)] ≤ Ep̃(x,y) [ Eqφ(W) [ log p(Y|X,W) ] −KL(qφ(W)||pθ(W)) ] (2)
منابع مشابه
Deep Neural Networks as Gaussian Processes
A deep fully-connected neural network with an i.i.d. prior over its parameters is equivalent to a Gaussian process (GP) in the limit of infinite network width. This correspondence enables exact Bayesian inference for neural networks on regression tasks by means of straightforward matrix computations. For single hiddenlayer networks, the covariance function of this GP has long been known. Recent...
متن کاملScalable Gaussian Process Regression Using Deep Neural Networks
We propose a scalable Gaussian process model for regression by applying a deep neural network as the feature-mapping function. We first pre-train the deep neural network with a stacked denoising auto-encoder in an unsupervised way. Then, we perform a Bayesian linear regression on the top layer of the pre-trained deep network. The resulting model, Deep-Neural-Network-based Gaussian Process (DNN-...
متن کاملDeep Gaussian Processes for Regression using Approximate Expectation Propagation
Deep Gaussian processes (DGPs) are multi-layer hierarchical generalisations of Gaussian processes (GPs) and are formally equivalent to neural networks with multiple, infinitely wide hidden layers. DGPs are nonparametric probabilistic models and as such are arguably more flexible, have a greater capacity to generalise, and provide better calibrated uncertainty estimates than alternative deep mod...
متن کاملWide Deep Neural Networks
Whilst deep neural networks have shown great empirical success, there is still much work to be done to understand their theoretical properties. In this paper, we study the relationship between Gaussian processes with a recursive kernel definition and random wide fully connected feedforward networks with more than one hidden layer. We exhibit limiting procedures under which finite deep networks ...
متن کاملStructured and Efficient Variational Deep Learning with Matrix Gaussian Posteriors
We introduce a variational Bayesian neural network where the parameters are governed via a probability distribution on random matrices. Specifically, we employ a matrix variate Gaussian (Gupta & Nagar, 1999) parameter posterior distribution where we explicitly model the covariance among the input and output dimensions of each layer. Furthermore, with approximate covariance matrices we can achie...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2016